MS5130 Assignment_3

Author

Vartika Srivastava

Welcome to Assignment 3 for MS5130: Advanced Analysis with R

In this assignment, I will be showcasing my proficiency and the insights I’ve gained through my coursework in R programming. We will delve into complex statistical concepts, applying them to real-world data sets to demonstrate the power and versatility of R in solving analytical challenges. Join me as we explore the intricacies of data manipulation, visualization, and statistical modeling, highlighting the critical role of R in data science and statistical analysis

Enhancements used in my Assignment are displayed below📢
  • (BE1) Executing the R code inside Quarto

  • (BE2) Use multiple datasets

  • (BE3) Combine datasets together

  • (BE4) Synergy of quantitative and qualitative analysis

  • (BE5) Explanatory text:

  • (SE1) Depict your data streams using Mermaid

  • (SE2) Use of a private GitHub repository

  • (SE3) Use of geographical data analysis using Leaflet

  • (SE4) Use of interactive charts/graphs/plots

Selecting and Combining Datasets

The datasets mentioned are: the amazon sale report, which is a dataset from an earlier assignment (assignment1), and 2 new datasets I took for this assignment Product_dataset and amazon_products,

The code begins by loading three CSV files: ‘Amazon Sale Report.csv’, ‘Products.csv’, and ‘Amazon-Products.csv’. This is done using the read.csv function, with parameters set to ensure that the header row is recognized and string data is not automatically converted into factors.

For combining the datasets ,The inner_join function from the dplyr package is used to merge datasets based on common columns. This is a powerful feature for enhancing the analysis (BE2) because it allows for the integration of data from multiple sources showcasing (BE3) Combine datasets together, providing a more comprehensive view of the information.

First, Product_dataset is combined with amazon_sale_report using ‘Category’ as the key. This implies that the analysis is interested in exploring sales data within specific product categories. Then, the resulting dataset is further combined with amazon_products using ‘index’ as the key.

Our Final Dataset is combined_Product_sale

#Load libraries
library(readr)
library(dplyr)

# Load the datasets
amazon_sale_report <- read.csv("dataset/Amazon Sale Report.csv", header = TRUE , stringsAsFactors = FALSE)
Product_dataset <- read.csv('dataset/Products.csv', header = TRUE , stringsAsFactors = FALSE)
amazon_products <- read.csv('dataset/Amazon-Products.csv', header = TRUE , stringsAsFactors = FALSE)

# Combine test_dataset and amazon_sale_report on 'Category'

combined_Product_sale <- inner_join(Product_dataset, amazon_sale_report, by = "Category")

combined_sale_products <- inner_join(combined_Product_sale, amazon_products, by = "index")

#View the dataset
head(combined_sale_products)
     id                 Category Rating maincateg platform actprice1 norating1
1  2242  Casuals For Men  (Blue)    3.8       Men Flipkart       999     27928
2  2242  Casuals For Men  (Blue)    3.8       Men Flipkart       999     27928
3  2242  Casuals For Men  (Blue)    3.8       Men Flipkart       999     27928
4 20532 Women Black Flats Sandal    3.9     Women Flipkart       499      3015
5 20532 Women Black Flats Sandal    3.9     Women Flipkart       499      3015
6 20532 Women Black Flats Sandal    3.9     Women Flipkart       499      3015
  noreviews1 star_5f star_4f star_3f star_2f star_1f fulfilled1 index
1       3543   14238    4295    3457    1962    3976          1     0
2       3543   14238    4295    3457    1962    3976          1     0
3       3543   14238    4295    3457    1962    3976          1  3190
4        404    1458     657     397     182     321          1     1
5        404    1458     657     397     182     321          1     1
6        404    1458     657     397     182     321          1    39
             Order.ID      Date                       Status Fulfilment
1 405-8078784-5731545 4/30/2022                    Cancelled   Merchant
2 405-8078784-5731545 4/30/2022                    Cancelled   Merchant
3 171-8820044-0821917 4/28/2022                      Shipped     Amazon
4 171-9198151-1101146 4/30/2022 Shipped - Delivered to Buyer   Merchant
5 171-9198151-1101146 4/30/2022 Shipped - Delivered to Buyer   Merchant
6 403-4242957-8599555 4/30/2022                      Shipped     Amazon
  Sales.Channel ship.service.level   Style             SKU Size       ASIN
1     Amazon.in           Standard  SET389  SET389-KR-NP-S    S B09KXVBD7Z
2     Amazon.in           Standard  SET389  SET389-KR-NP-S    S B09KXVBD7Z
3     Amazon.in          Expedited JNE3463  JNE3463-KR-XXL  XXL B08RP77NWN
4     Amazon.in           Standard JNE3781 JNE3781-KR-XXXL  3XL B09K3WFS32
5     Amazon.in           Standard JNE3781 JNE3781-KR-XXXL  3XL B09K3WFS32
6     Amazon.in          Expedited JNE3405    JNE3405-KR-L    L B081WSCKPQ
  Courier.Status Qty currency Amount          ship.city  ship.state
1                  0      INR 647.62             MUMBAI MAHARASHTRA
2                  0      INR 647.62             MUMBAI MAHARASHTRA
3        Shipped   1      INR 534.00              THANE MAHARASHTRA
4        Shipped   1      INR 406.00          BENGALURU   KARNATAKA
5        Shipped   1      INR 406.00          BENGALURU   KARNATAKA
6        Shipped   1      INR 399.00 THIRUVANANTHAPURAM      KERALA
  ship.postal.code ship.country
1           400081           IN
2           400081           IN
3           400607           IN
4           560085           IN
5           560085           IN
6           695011           IN
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           promotion.ids
1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
4 Amazon PLCC Free-Financing Universal Merchant AAT-WNKTBO3K27EJC,Amazon PLCC Free-Financing Universal Merchant AAT-QX3UCCJESKPA2,Amazon PLCC Free-Financing Universal Merchant AAT-5QQ7BIYYQEDN2,Amazon PLCC Free-Financing Universal Merchant AAT-DSJ2QRXXWXVMQ,Amazon PLCC Free-Financing Universal Merchant AAT-CXJHMC2YJUK76,Amazon PLCC Free-Financing Universal Merchant AAT-CC4FAVTYR4X7C,Amazon PLCC Free-Financing Universal Merchant AAT-XXRCW6NZEPZI4,Amazon PLCC Free-Financing Universal Merchant AAT-CXNSLNBROFDW4,Amazon PLCC Free-Financing Universal Merchant AAT-R7GXNZWISTRFA,Amazon PLCC Free-Financing Universal Merchant AAT-WSJLDN3X7KEMO,Amazon PLCC Free-Financing Universal Merchant AAT-VL6FGQVGQVXUS,Amazon PLCC Free-Financing Universal Merchant AAT-EOKPWFWYW7Y6I,Amazon PLCC Free-Financing Universal Merchant AAT-ZYL5UPUNW6T62,Amazon PLCC Free-Financing Universal Merchant AAT-XVPICCHRWDCAI,Amazon PLCC Free-Financing Universal Merchant AAT-ETXQ3XXWMRXBG,Amazon PLCC Free-Financing Universal Merchant AAT-7X3XCTYG64VBE,Amazon PLCC Free-Financing Universal Merchant AAT-7CHGD3WTS3MHM,Amazon PLCC Free-Financing Universal Merchant AAT-26ZDKNME27X42,Amazon PLCC Free-Financing Universal Merchant AAT-4ZF5KN6E4LJK4,Amazon PLCC Free-Financing Universal Merchant AAT-7RCXIKUAX7DDY,Amazon PLCC Free-Financing Universal Merchant AAT-BRSZZ45H6MHAO,Amazon PLCC Free-Financing Universal Merchant AAT-MKLXOOZWQL7GO,Amazon PLCC Free-Financing Universal Merchant AAT-CB7UNXEXGIJTC,Amazon PLCC Free-Financing Universal Merchant #MP-gzasho-1593152694811,Amazon PLCC Free-Financing Universal Merchant AAT-WLBA4GZ52EAH4
5 Amazon PLCC Free-Financing Universal Merchant AAT-WNKTBO3K27EJC,Amazon PLCC Free-Financing Universal Merchant AAT-QX3UCCJESKPA2,Amazon PLCC Free-Financing Universal Merchant AAT-5QQ7BIYYQEDN2,Amazon PLCC Free-Financing Universal Merchant AAT-DSJ2QRXXWXVMQ,Amazon PLCC Free-Financing Universal Merchant AAT-CXJHMC2YJUK76,Amazon PLCC Free-Financing Universal Merchant AAT-CC4FAVTYR4X7C,Amazon PLCC Free-Financing Universal Merchant AAT-XXRCW6NZEPZI4,Amazon PLCC Free-Financing Universal Merchant AAT-CXNSLNBROFDW4,Amazon PLCC Free-Financing Universal Merchant AAT-R7GXNZWISTRFA,Amazon PLCC Free-Financing Universal Merchant AAT-WSJLDN3X7KEMO,Amazon PLCC Free-Financing Universal Merchant AAT-VL6FGQVGQVXUS,Amazon PLCC Free-Financing Universal Merchant AAT-EOKPWFWYW7Y6I,Amazon PLCC Free-Financing Universal Merchant AAT-ZYL5UPUNW6T62,Amazon PLCC Free-Financing Universal Merchant AAT-XVPICCHRWDCAI,Amazon PLCC Free-Financing Universal Merchant AAT-ETXQ3XXWMRXBG,Amazon PLCC Free-Financing Universal Merchant AAT-7X3XCTYG64VBE,Amazon PLCC Free-Financing Universal Merchant AAT-7CHGD3WTS3MHM,Amazon PLCC Free-Financing Universal Merchant AAT-26ZDKNME27X42,Amazon PLCC Free-Financing Universal Merchant AAT-4ZF5KN6E4LJK4,Amazon PLCC Free-Financing Universal Merchant AAT-7RCXIKUAX7DDY,Amazon PLCC Free-Financing Universal Merchant AAT-BRSZZ45H6MHAO,Amazon PLCC Free-Financing Universal Merchant AAT-MKLXOOZWQL7GO,Amazon PLCC Free-Financing Universal Merchant AAT-CB7UNXEXGIJTC,Amazon PLCC Free-Financing Universal Merchant #MP-gzasho-1593152694811,Amazon PLCC Free-Financing Universal Merchant AAT-WLBA4GZ52EAH4
6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
    B2B fulfilled.by
1 FALSE    Easy Ship
2 FALSE    Easy Ship
3 FALSE             
4 FALSE    Easy Ship
5 FALSE    Easy Ship
6 FALSE             
                                                                                                                           name
1 Lloyd 1.5 Ton 3 Star Inverter Split Ac (5 In 1 Convertible, Copper, Anti-Viral + Pm 2.5 Filter, 2023 Model, White, Gls18I3...
2 Pigeon by Stovekraft Amaze Plus Electric Kettle (14289) with Stainless Steel Body, 1.5 litre, used for boiling Water, maki...
3                                                         IFB Filterless ChimneyGL-HC-105-60) with Handsensor & Easy Heat Clean
4 LG 1.5 Ton 5 Star AI DUAL Inverter Split AC (Copper, Super Convertible 6-in-1 Cooling, HD Filter with Anti-Virus Protectio...
5 Pigeon Polypropylene Mini Handy and Compact Chopper with 3 Blades for Effortlessly Chopping Vegetables and Fruits for Your...
6  Panasonic 1.5 Ton 3 Star Wi-Fi Inverter Smart Split AC (Copper Condenser, 7 in 1 Convertible with additional AI Mode, PM ...
  main_category     sub_category
1    appliances Air Conditioners
2    appliances   All Appliances
3    appliances   All Appliances
4    appliances Air Conditioners
5    appliances   All Appliances
6    appliances Air Conditioners
                                                                                             image
1                                   https://m.media-amazon.com/images/I/31UISB90sYL._AC_UL320_.jpg
2 https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/51DGcy8eBCL._AC_UL320_.jpg
3                                   https://m.media-amazon.com/images/I/61X5d95OHcL._AC_UL320_.jpg
4                                   https://m.media-amazon.com/images/I/51JFb7FctDL._AC_UL320_.jpg
5 https://m.media-amazon.com/images/W/IMAGERENDERING_521856-T1/images/I/51RXzjrUmkL._AC_UL320_.jpg
6                                   https://m.media-amazon.com/images/I/61PWjQFDtQL._AC_UL320_.jpg
                                                                                                                                         link
1           https://www.amazon.in/Lloyd-Inverter-Convertible-Anti-Viral-GLS18I3FWAMC/dp/B0BRKXTSBT/ref=sr_1_4?qid=1679134237&s=kitchen&sr=1-4
2                                 https://www.amazon.in/Pigeon-Amaze-Plus-1-5-Ltr/dp/B07WMS7TWB/ref=sr_1_1?qid=1679135585&s=appliances&sr=1-1
3   https://www.amazon.in/IFB-Auto-Clean-GL-HC-105-60-Filterless-Technology/dp/B08PGY65PQ/ref=sr_1_3191?qid=1679135774&s=appliances&sr=1-3191
4              https://www.amazon.in/LG-Convertible-Anti-Virus-Protection-RS-Q19YNZE/dp/B0BQ3MXML8/ref=sr_1_5?qid=1679134237&s=kitchen&sr=1-5
5                  https://www.amazon.in/Pigeon-Stovekraft-Plastic-Chopper-Blades/dp/B01LWYDEQ7/ref=sr_1_2?qid=1679135585&s=appliances&sr=1-2
6 https://www.amazon.in/Panasonic-Convertible-additional-Purification-CU-SU18YKYWT/dp/B0BRJ5QH8G/ref=sr_1_46?qid=1679134239&s=kitchen&sr=1-46
  ratings no_of_ratings discount_price actual_price
1     4.2         2,255        ₹32,999      ₹58,990
2     3.9       128,941           ₹599       ₹1,245
3       2             1        ₹29,078      ₹34,490
4     4.2         2,948        ₹46,490      ₹75,990
5     4.1       274,505           ₹199         ₹545
6     4.2         1,830        ₹37,990      ₹55,400

Enhancements Used: BE1, BE2, BE3, BE5

Quantitative Analysis

GLM MODEL

The code begins by importing the stats library, which is part of R’s base packages and provides a broad array of statistical functions, including those needed for performing generalized linear models (GLM).

The pre-processing step involves converting the ‘Status’ variable into a binary format, where the status “Cancelled” is represented as 1, and all other statuses are represented as 0.

A GLM is then fitted to the pre-processed data, with the binary ‘Cancelled’ variable as the response and the numerical variables ‘Amount’ and ‘Qty’ (quantity) as predictors.

The glm function is used with the family = binomial argument, specifying that a logistic regression model is to be fitted. This type of model is chosen because the response variable is binary, and the goal is to understand how changes in the amount and quantity of orders relate to the likelihood of an order being cancelled.  
This analysis demonstrates (BE4)

#import libraries
library(stats)

# Preprocess the data
# Convert the Status to a binary variable (1 for Cancelled, 0 otherwise)

combined_sale_products<- combined_sale_products %>%mutate(Cancelled = if_else(Status == "Cancelled", 1, 0))

# Fit a GLM 
glm_model <- glm(Cancelled ~ Amount + Qty, data = combined_sale_products, family = binomial)

# Summary of the model
summary(glm_model)

Call:
glm(formula = Cancelled ~ Amount + Qty, family = binomial, data = combined_sale_products)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  3.228e+00  1.042e-01  30.976  < 2e-16 ***
Amount       3.182e-04  5.039e-05   6.314 2.72e-10 ***
Qty         -6.344e+00  9.975e-02 -63.601  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57298  on 100332  degrees of freedom
Residual deviance: 40151  on 100330  degrees of freedom
  (7230 observations deleted due to missingness)
AIC: 40157

Number of Fisher Scoring iterations: 6

Poisson GLM Model

Switching gears to a Poisson generalized linear model (GLM) demonstrating (SE5). This section of analysis aims to explore the relationship between the quantity of products ordered and the amount of those orders (Amount) using a Poisson regression model.

The Poisson GLM is particularly suited for modeling count data, where the response variable represents counts or numbers of events (in this case, the quantity of products ordered i.e. Amount)

# Assuming Qty is  count response and Amount is a predictor

# Fit a Poisson GLM 
poisson_model <- glm(Qty ~ Amount, data = combined_sale_products, family = poisson())

# Summary of the model
summary(poisson_model)

Call:
glm(formula = Qty ~ Amount, family = poisson(), data = combined_sale_products)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.776e-02  7.851e-03  -3.536 0.000406 ***
Amount      -6.976e-06  1.135e-05  -0.615 0.538846    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6979.0  on 100332  degrees of freedom
Residual deviance: 6978.6  on 100331  degrees of freedom
  (7230 observations deleted due to missingness)
AIC: 200937

Number of Fisher Scoring iterations: 4

Gaussian GLM model

In this stage of analysis where I am demonstrating (SE5), focus on modeling the relationship between the order amount (Amount), the quantity of products ordered (Qty), and the order status (Status) using a Gaussian generalized linear model (GLM).

Before fitting the model, I ensured that the Status variable is treated as a categorical factor by converting it with as.factor(). This is crucial because Status likely represents different categories of order statuses , and treating it as a factor allows the model to appropriately handle it as a nominal variable with discrete levels, rather than as a numeric variable. This model is aiming to predict the order amount based on the quantity of products ordered and the status of the order.

# Convert Status to a factor if it's not already
combined_sale_products$Status <- as.factor(combined_sale_products$Status)

# Fit a Gaussian GLM 
gaussian_model <- glm(Amount ~ Qty + Status, data = combined_sale_products, family = gaussian())

# Summary of the model
summary(gaussian_model)

Call:
glm(formula = Amount ~ Qty + Status, family = gaussian(), data = combined_sale_products)

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         652.607      4.618 141.316  < 2e-16 ***
Qty                                  35.011      5.739   6.100 1.06e-09 ***
StatusShipped                       -54.038      3.999 -13.513  < 2e-16 ***
StatusShipped - Delivered to Buyer  -71.430      4.258 -16.775  < 2e-16 ***
StatusShipped - Lost in Transit    -687.618     18.094 -38.003  < 2e-16 ***
StatusShipped - Rejected by Buyer  -119.618    140.393  -0.852  0.39420    
StatusShipped - Returned to Seller  -20.781      7.889  -2.634  0.00844 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 78781.35)

    Null deviance: 8029925327  on 100332  degrees of freedom
Residual deviance: 7903817248  on 100326  degrees of freedom
  (7230 observations deleted due to missingness)
AIC: 1415939

Number of Fisher Scoring iterations: 2

Compare the Akaike Information Criterion (AIC) values for three different generalized linear models (GLMs) — Binomial, Poisson, and Gaussian

AIC is a measure of the relative quality of statistical models for a given set of data. Lower AIC values generally indicate a model that better fits the data

library(ggplot2)

#Comparing models using AIC
aic_values <- data.frame(
  Model = c("Binomial", "Poisson", "Gaussian"),
  AIC = c(AIC(glm_model), AIC(poisson_model), AIC(gaussian_model))
)

# Plotting
ggplot(aic_values, aes(x = Model, y = AIC, fill = Model)) +
  geom_bar(stat = "identity") +
  labs(title = "Model AIC Comparison") +
  theme_minimal()

The bar plot displays a comparison of the Akaike Information Criterion (AIC) values for three different generalized linear models: Binomial, Poisson, and Gaussian. Considering these AIC values, you would typically favor the Binomial model for further analysis and predictive tasks.

Linear Regression Model

In the below code we are performing a Linear Regression Model to understand the impact of fulfillment status and the number of reviews on product ratings .

The initial steps involve preparing combined_sale_products dataset for analysis. This includes filtering out rows with missing values in specific columns (Rating, fulfilled1, and noreviews1) and converting fulfilled1 into a factor .

This interaction allows for examining how the relationship between the number of reviews (noreviews1) and the rating (Rating) changes based on the fulfillment status (fulfilled1).  
This analysis demonstrates (BE4)

# Proceed with data cleaning and model creation
# Filter out rows with missing values in these columns and convert 'fulfilled1' to a factor
dataset_clean <- combined_sale_products %>%
  filter(!is.na(Rating), !is.na(fulfilled1), !is.na(noreviews1)) %>%
  mutate(fulfilled1 = as.factor(fulfilled1)) # Convert to factor for the interaction term

# Create a linear regression model with an interaction between fulfilled1 and noreviews1
model <- lm(Rating ~ fulfilled1 * noreviews1, data = combined_sale_products)

# Summary of the model
summary(model)

Call:
lm(formula = Rating ~ fulfilled1 * noreviews1, data = combined_sale_products)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9251 -0.1344 -0.0187  0.1671  1.0749 

Coefficients:
                        Estimate Std. Error  t value Pr(>|t|)    
(Intercept)            3.925e+00  1.404e-03 2796.031  < 2e-16 ***
fulfilled1             1.097e-01  1.869e-03   58.680  < 2e-16 ***
noreviews1             9.448e-06  3.558e-06    2.655  0.00793 ** 
fulfilled1:noreviews1 -1.798e-05  3.785e-06   -4.751 2.03e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2876 on 103058 degrees of freedom
  (4501 observations deleted due to missingness)
Multiple R-squared:  0.03332,   Adjusted R-squared:  0.0333 
F-statistic:  1184 on 3 and 103058 DF,  p-value: < 2.2e-16
# Creating the interaction plot with ggplot2
ggplot(dataset_clean, aes(x = noreviews1, y = Rating, color = as.factor(fulfilled1))) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, aes(group = fulfilled1)) + # Ensure group aesthetic uses the factor
  scale_color_manual(values = c("blue", "red")) + # Manually specify colors for the factor levels
  labs(title = "Interaction Effect of Fulfillment and Number of Reviews on Rating",
       x = "Number of Reviews",
       y = "Rating",
       color = "Fulfilled by Platform") +
  theme_minimal() # Use a minimal theme for the plot

The plot depicts a flat, parallel trend between the number of reviews and product ratings for both platform-fulfilled and otherwise, suggesting that review volume has a negligible effect on ratings. Ratings cluster at the high end, indicating a general customer satisfaction or rating scale bias. The similarity in trends regardless of fulfillment status implies that how a product is fulfilled does not significantly influence its ratings.

Enhancements Used in Quantitative Analysis: BE1, BE4, BE5 ,SE5

Qualitative Analysis

Text Mining

The code snippet demonstrates the use of text mining techniques in R by demonstrating (BE4) to process and visualize text data from a column, presumed to be ‘ship-state’, in the combined_sale_products dataset.

The text mining process here focuses on extracting insights from the ‘ship-state’ column, which contains text data related to shipping locations. By normalizing and pre-processing the text data, I remove irrelevant characters and standardize the text, ensuring that variations in formatting do not skew the analysis.

The resultant word cloud will be a visual representation of the frequency of each state within the data; the size of each word in the cloud corresponds to its frequency or importance in the corpus. It is a popular way to highlight the most prominent elements in textual data, providing an immediate visual summary of the text’s content.

library(readr)
library(dplyr)
library(tm)
library(wordcloud)
library(ggplot2)


# Assuming 'ship-state' is the column of interest
# Convert to lowercase for its consistency
combined_sale_products$ship.state <- tolower(combined_sale_products$ship.state)

# Create a text corpus
corpus <- Corpus(VectorSource(combined_sale_products$ship.state))

# Optional: preprocess the text
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

# Generate the word cloud
wordcloud(corpus, max.words = 100, random.order = FALSE, rot.per = 0.35, colors=brewer.pal(8, "Dark2"))

The word cloud provides a visual representation of shipping locations from the dataset, with the size of each word reflecting its frequency. Larger words like “maharashtra,” “uttarpradesh,” and “telangana” indicate these states are common in the shipping . Smaller words represent less frequently occurring states. This visual summary quickly communicates which regions has maximum shipping and which regions has minimum shipping

Enhancements Used in Qualitative Analysis: BE1, BE4, BE5

Flow Diagram

This Diagram outlines a detailed process involving loading datasets, combining them, pre-processing, fitting different models, comparing those models, and generating various plots for analysis.

flowchart TB
    load_datasets{Load Datasets} -->|read.csv| dataset1[Amazon Sale Report]
    load_datasets -->|read.csv| dataset2[Products]
    load_datasets -->|read.csv| dataset3[Amazon Products]
    dataset1 --> merge1[Combine Datasets]
    dataset2 --> merge1
    dataset3 --> merge1
    merge1 --> preprocess{Preprocess Data}
    preprocess -->|mutate| status_bin[Convert Status to Binary]
    status_bin --> fit_models{Fit Models}
    fit_models --> glm_model[GLM Binomial]
    fit_models --> poisson_model[Poisson GLM]
    fit_models --> gaussian_model[Gaussian GLM]
    glm_model --> compare{Compare Models}
    poisson_model --> compare
    gaussian_model --> compare
    compare -->|ggplot| aic_plot[AIC Comparison]
    preprocess -->|filter & mutate| clean_dataset[Clean Dataset for Additional Analysis]
    clean_dataset --> lm_model[Linear Regression Model]
    lm_model -->|ggplot| interaction_plot[Interaction Effect Plot]
    preprocess -->|aggregate| sales_over_time[Aggregate Sales Over Time]
    sales_over_time -->|plotly| sales_plot[Total Sales Over Time Plot]
    preprocess -->|gsub & as.numeric| ratings_hist[Convert Ratings to Numeric]
    ratings_hist -->|plotly| ratings_distribution[Histogram of Ratings Distribution]
    load_datasets -->|leaflet| map_visualization[Map Visualization]
    load_datasets -->|tm & wordcloud| word_cloud[Generate Word Cloud]
   

Enhancements Used: BE1, BE5, SE1

Shipping Destinations by State: An Overview of Indian Logistics

library(leaflet)
library(dplyr)

# Define the states and their coordinates
states_coords <- data.frame(
  name = c("Andaman & Nicobar Islands", "Andhra Pradesh", "Arunachal Pradesh", "Assam", 
           "Bihar", "Chandigarh", "Chhattisgarh", "Dadra and Nagar Haveli", "Delhi", 
           "Gujarat", "Haryana", "Himachal Pradesh", "Jammu & Kashmir", "Jharkhand", 
           "Karnataka", "Kerala", "Ladakh", "Madhya Pradesh", "Maharashtra", "Manipur", 
           "Meghalaya", "Mizoram", "Nagaland", "Odisha", "Puducherry", "Punjab", 
           "Rajasthan", "Sikkim", "Tamil Nadu", "Telangana", "Tripura", "Uttar Pradesh", 
           "Uttarakhand", "West Bengal"),
  latitude = c(11.667025, 15.9129, 28.2180, 26.2006, 25.0961, 30.7333, 21.2787, 20.1809, 
               28.7041, 22.2587, 29.0588, 31.1048, 33.7782, 23.6102, 15.3173, 10.8505, 
               34.1526, 22.9734, 19.7515, 24.6637, 25.4670, 23.1645, 26.1584, 20.9517, 
               11.9416, 31.1471, 27.0238, 27.5330, 11.1271, 18.1124, 23.9408, 26.8467, 
               30.0668, 22.9868),
  longitude = c(92.735983, 79.7400, 94.7278, 92.9376, 85.3131, 76.7794, 81.8661, 73.0169, 
                77.1025, 71.1924, 76.0856, 77.1734, 76.5762, 85.2799, 75.7139, 76.2711, 
                77.5770, 78.6569, 75.7139, 93.9063, 91.3662, 92.9376, 94.5624, 85.0985, 
                79.8083, 75.3412, 74.2179, 88.5122, 78.6569, 79.0193, 91.9882, 80.9462, 
                79.0193, 87.8550)
)

# Initialize a Leaflet map
map <- leaflet() %>%
  addProviderTiles(providers$OpenStreetMap) %>%  # Add base map tiles
  setView(lng = 78.9629, lat = 20.5937, zoom = 5)  # Center the map on India

# Add markers for each state
for(i in 1:nrow(states_coords)) {
  map <- map %>%
    addMarkers(lng = states_coords$longitude[i], lat = states_coords$latitude[i],
               popup = states_coords$name[i])
}

# Display the map
map

The map illustrates the various shipping destinations across India

Enhancements Used: BE1, BE5, SE3

Interative Graphs/Plots

Graph 1

The code transforms a ‘Date’ column to the appropriate Date format and aggregates sales data by date to analyze trends over time. It then leverages plotly to create an interactive line chart.

library(readr)
library(dplyr)
library(plotly)

# Convert Date column to Date type
combined_sale_products$Date <- as.Date(combined_sale_products$Date, format = "%m/%d/%Y")

# Aggregate sales by date
sales_over_time <- combined_sale_products %>%
  group_by(Date) %>%
  summarise(TotalSales = sum(Amount, na.rm = TRUE))

# 'sales_over_time' is the data frame with Date and TotalSales columns
plot <- plot_ly(data = sales_over_time, x = ~Date, y = ~TotalSales, type = 'scatter', mode = 'lines+markers') %>%
  layout(title = 'Total Sales Over Time',
         xaxis = list(title = 'Date'),
         yaxis = list(title = 'Total Sales (INR)'))

# Print the plot
plot

The graph depicts daily sales totals over time which can further help business analysis and decision-making based on sales patterns..

Graph 2

The code helps to convert the colomn ‘ratings’ data by stripping out commas and non-numeric characters, then converts the cleansed strings to numeric values, readying them for analysis. Using Plotly, an interactive histogram is created to display the distribution of product ratings.

# Convert ratings to numeric (after replacing commas and removing non-numeric characters)
combined_sale_products$ratings <- as.numeric(gsub(",", "", gsub("[^0-9.]", "", combined_sale_products$ratings)))

# Interactive Histogram of Ratings Distribution
plot <- plot_ly(data = amazon_products, x = ~ratings, type = "histogram") %>%
  layout(title = 'Distribution of Product Ratings',
         xaxis = list(title = 'Ratings'),
         yaxis = list(title = 'Count'))

# Print the plot
plot

This histogram allows for an intuitive exploration of the ratings’ frequency, showing how often each rating occurs within the dataset. This visualization helps in understanding customer satisfaction levels

Enhancements Used : BE1, BE5, SE4